[SPARK-3964] [MLlib] [PySpark] add Hypothesis test Python API by davies · Pull Request #3091 · apache/spark

davies · 2014-11-04T17:34:46Z

pyspark.mllib.stat.StatisticschiSqTest(observed, expected=None)
    :: Experimental ::

    If `observed` is Vector, conduct Pearson's chi-squared goodness
    of fit test of the observed data against the expected distribution,
    or againt the uniform distribution (by default), with each category
    having an expected frequency of `1 / len(observed)`.
    (Note: `observed` cannot contain negative values)

    If `observed` is matrix, conduct Pearson's independence test on the
    input contingency matrix, which cannot contain negative entries or
    columns or rows that sum up to 0.

    If `observed` is an RDD of LabeledPoint, conduct Pearson's independence
    test for every feature against the label across the input RDD.
    For each feature, the (feature, label) pairs are converted into a
    contingency matrix for which the chi-squared statistic is computed.
    All label and feature values must be categorical.

    :param observed: it could be a vector containing the observed categorical
                     counts/relative frequencies, or the contingency matrix
                     (containing either counts or relative frequencies),
                     or an RDD of LabeledPoint containing the labeled dataset
                     with categorical features. Real-valued features will be
                     treated as categorical for each distinct value.
    :param expected: Vector containing the expected categorical counts/relative
                     frequencies. `expected` is rescaled if the `expected` sum
                     differs from the `observed` sum.
    :return: ChiSquaredTest object containing the test statistic, degrees
             of freedom, p-value, the method used, and the null hypothesis.

davies · 2014-11-04T17:34:57Z

cc @mengxr

SparkQA · 2014-11-04T17:39:47Z

Test build #22882 has started for PR 3091 at commit 5097d54.

This patch merges cleanly.

SparkQA · 2014-11-04T18:48:40Z

Test build #22882 has finished for PR 3091 at commit 5097d54.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class Matrices(object):
- class ChiSqTestResult(JavaModelWrapper):

AmplabJenkins · 2014-11-04T18:48:43Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22882/
Test FAILed.

SparkQA · 2014-11-04T18:57:39Z

Test build #22886 has started for PR 3091 at commit 0ab0764.

This patch merges cleanly.

SparkQA · 2014-11-04T20:19:58Z

Test build #22886 has finished for PR 3091 at commit 0ab0764.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class Matrices(object):
- class ChiSqTestResult(JavaModelWrapper):

AmplabJenkins · 2014-11-04T20:20:02Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22886/
Test PASSed.

mengxr · 2014-11-05T03:02:05Z

python/pyspark/mllib/common.py

What happens if r is JavaArray or JavaList but not pickleable? Are we expecting that downstream can handle it?

The caller will handle it. The JavaArray/JavaList is iterable in Python, caller can access the internal objects in this array/list.

SparkQA · 2014-11-05T03:22:34Z

Test build #22913 has started for PR 3091 at commit 145d16c.

This patch merges cleanly.

SparkQA · 2014-11-05T04:47:59Z

Test build #22913 has finished for PR 3091 at commit 145d16c.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class Matrices(object):
- class ChiSqTestResult(JavaModelWrapper):

AmplabJenkins · 2014-11-05T04:48:02Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22913/
Test PASSed.

mengxr · 2014-11-05T05:36:44Z

LGTM. Merged into master and branch-1.2. Thanks @davies !

``` pyspark.mllib.stat.StatisticschiSqTest(observed, expected=None) :: Experimental :: If `observed` is Vector, conduct Pearson's chi-squared goodness of fit test of the observed data against the expected distribution, or againt the uniform distribution (by default), with each category having an expected frequency of `1 / len(observed)`. (Note: `observed` cannot contain negative values) If `observed` is matrix, conduct Pearson's independence test on the input contingency matrix, which cannot contain negative entries or columns or rows that sum up to 0. If `observed` is an RDD of LabeledPoint, conduct Pearson's independence test for every feature against the label across the input RDD. For each feature, the (feature, label) pairs are converted into a contingency matrix for which the chi-squared statistic is computed. All label and feature values must be categorical. :param observed: it could be a vector containing the observed categorical counts/relative frequencies, or the contingency matrix (containing either counts or relative frequencies), or an RDD of LabeledPoint containing the labeled dataset with categorical features. Real-valued features will be treated as categorical for each distinct value. :param expected: Vector containing the expected categorical counts/relative frequencies. `expected` is rescaled if the `expected` sum differs from the `observed` sum. :return: ChiSquaredTest object containing the test statistic, degrees of freedom, p-value, the method used, and the null hypothesis. ``` Author: Davies Liu <davies@databricks.com> Closes #3091 from davies/his and squashes the following commits: 145d16c [Davies Liu] address comments 0ab0764 [Davies Liu] fix float 5097d54 [Davies Liu] add Hypothesis test Python API (cherry picked from commit c8abddc) Signed-off-by: Xiangrui Meng <meng@databricks.com>

add Hypothesis test Python API

5097d54

davies changed the title ~~[SPARK-3694] [MLlib] [PySpark] add Hypothesis test Python API~~ [SPARK-3964] [MLlib] [PySpark] add Hypothesis test Python API Nov 4, 2014

fix float

0ab0764

mengxr reviewed Nov 5, 2014
View reviewed changes

address comments

145d16c

asfgit closed this in c8abddc Nov 5, 2014

Conversation

davies commented Nov 4, 2014

Uh oh!

davies commented Nov 4, 2014

Uh oh!

SparkQA commented Nov 4, 2014

Uh oh!

SparkQA commented Nov 4, 2014

Uh oh!

AmplabJenkins commented Nov 4, 2014

Uh oh!

SparkQA commented Nov 4, 2014

Uh oh!

SparkQA commented Nov 4, 2014

Uh oh!

AmplabJenkins commented Nov 4, 2014

Uh oh!

mengxr Nov 5, 2014

Choose a reason for hiding this comment

Uh oh!

davies Nov 5, 2014

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 5, 2014

Uh oh!

SparkQA commented Nov 5, 2014

Uh oh!

AmplabJenkins commented Nov 5, 2014

Uh oh!

mengxr commented Nov 5, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants